Project Github Repo:

https://github.com/bplee/final_project.git

Table of Contents:

Table of Contents
Introduction
Results
Methods
Discussion
Data Summary
References

Introduction

Biomarkers for the prediction of age have the potential to be extremely powerful in modern biology. While predicting someone’s age might seem strange in a clinical setting where you can just ask the patient their date of birth, quality prediction from a biological dataset guarantees that the effects of aging are leaving measurable as well as targetable signatures encoded in our biology. These age biomarkers may give clinicians insight into the true “biological age” of a patient. Such information could lead to testing to extend human lifespan or specific care for the patients of aging disorders to increase longevity.

Among the most cited biomarkers for aging are a number of epigenetic and immune markers (Sen et al. (2016)). However, this project, based off work done by Fleischer et al. published in Genome Biology in 2018, explores the hypothesis that the transcriptomes of human dermal fibroblasts contain rich features for the prediction of age (Fleischer et al. (2018)). It is well recorded that fibroblasts can hold onto the signatures of a number of age dependent changes, witnessed in the transcriptome and epigenome (Phillip et al. (2017)). Low proliferation rate also suggests that fibroblasts in human skin collect and retain damage that occurs with age (Tigges et al. (2014)). Such features makes the human skin fibroblast a promising candidate for age prediction tasks.

The dataset for this project contained the transcriptomes of 143 samples from human participants aged 1 to 94. The data set is hosted on the Gene Expression Omnibus, accession number GSE113957. A small subset of the data included the transcriptomes of individuals living with Hutchinson-Gilford progeria syndrome (HGPS), a premature aging syndrome which leads to a shortened lifespan in humans. Qualitatively, the researchers wanted their model to be able to accurately identify accelerated aging in progeria patients.

This project tested two approaches to the prediction task, a regression baseline and an linear discriminant analysis (LDA) ensemble classifier model. Both models showed predictive promise, with the LDA ensemble model scoring lower mean squared errors (MSE’s) than the linear regression model on both the training and testing set (24 data points each).

Results

For the prediction task, the LDA ensemble model performed slightly better than the linear regression model on the testing set, and much better on the training set in terms of MSE. The MSE results are reported in the table below.

Training Set Testing Set
Linear Regression 156.438 433.34
LDA Ensemble 107.5 420.33

The plots featuring the actual ages vs predicted ages for all models and datasets are shown in the following Methods section.

Ultimately, the success of the LDA ensemble method comes down to its discretization of the prediction space. This allows for the individual classifiers tune more to an age-bin specific set of rich features, as opposed to attempting to resolve average quality prediction across all ages.

The age-bin widths and overlap utilized in the model were chosen to be as small as possible given the distribution of ages in the training set. However, bins of length 20, and separated every 6 years did not feel like it had enough resolution. Had only the training set selected been slightly more favorable, to allow for smaller and closer age bins, the LDA ensemble method might have had slightly better resolution for predictions, and possibly better performance on the test set. But as always, with more data comes questions regarding the construction of this discretization. Should all age bins be the same size? How do hierarchical ensembles perform? Attempting to answer these questions regarding the efficacy of ensemble models in biologic data sets could be difficult for a couple reasons. The data sets pertaining to ages are “small” in the eyes of the machine learning community, and because the underlying affects of aging on human biological proceses is not well understood, it may be a long time before we have enough data to truly understand the strengths or limitations of ensemble classification on biological datasets.

Methods

Reproducibility note: The versions of all software packages used in the preprocessing and analysis are listed in a table at the end of the paper.

Preprocessing:

Data

The data was accessed from GEO via the accession number GSE113957. The data contained 143 samples from persons age 1-94, and included 10 sample from individuals with HGPS. Most of samples were run on a NextSeq 500, however some were run on an Illumina HiSeq 2500. Each sample was annotated with 4 features: age, ethnicity, gender, and disease condition. A two-way table describing the size of the groups across gender and disease condition is listed below, accompanied with a figure depicting the distribution of ages within each group.

HGPS Normal
Female 7 34
Male 3 99

Due to storage constraints, only a subset of the data (48 points) was downloaded and prepared for future analysis from SRA using the sra-toolkit package. This subset was taken only from the “normal”, males group to avoid introducing unnecessary variance from different classes in the training data. The accession numbers for the training and testing sets used in this report can be found in the GitHub in the meta_data_files/ directory, with the names normal_M_uniform_ages.txt and normal_M_uniform_ages_validation.txt respectively. The age distributions were suppose to be uniform to ensure good coverage of the range of ages seen in the complete dataset, however, due to a bug, the distributions were not completely uniform. The distributions for both sets are shown below.

Trimgalore

The reads were passed through trim_galore to remove the appearance of adapter sequences in the data. The data was passed to the trim_galore command with default settings, allowing the program to auto-detect the adapters being used.

Experiment QC

Following trimming, the trimmed files were piped into fasqc with default settings for quality control. The reports were aggregated with multiqc, and inspected. The multiQC reported that most of the samples failed the sequence duplication levels, probably as a consequence of biased PCR duplication. Without the use of UMI’s or spike ins, we made no attempt to mitigate the variance brought about by the sequence duplication levels.

Alignment via STAR

Alignment to the human genome required a human genome and a GTF file (downloaded from UCSC Genome Browser). An index was created with an --sjdbOverhang set to 74, because the majority of reads were of length 75. The GTF file required manual transformation of the chromosome annotations to the annotations used in the FASTA file.

Constructing Count Matrix

Construction of the count matrix was performed using the featureCounts function from the package subread, and aligned to the GTF file downloaded from ensembl using the default parameters. The end result yielded 60676 genes total for two sets of 24 samples.

Count Normalization

All further analysis was performed locally in R. The counts matrix was inspected and normalized with the DESeq2 package offered through BioConductor. The count data was set up within DESeq object, one for the training set and one for the testing set. All genes with an average count less than 1 in either the training or testing set were taken out of both sets, leaving 19964 genes. Estimation of size factors and normalization was performed independently for both data sets, and rlog transformations were also performed independently. Below I show the effects of the normalization for the training data.

Dimensionality Reduction

Principle components analysis (PCA) was conducted on the training set for visualization. The data from the testing set was then projected onto the principle components of the training set, to see if the gradient of ages still held along the first PC.

Differential Gene Expression Analysis

Differential gene expression (DGE) analysis was also conducted with DESeq2 by calling the DESeq function, then the results function. Further filtering was performed by taking only the genes with significant adjusted p-values (\(\alpha = 0.05\)). This left a set of 808 genes, upon which another PCA was performed on the training set, yielding a 24 by 24 matrix that was then used for all downstream analyses.

Prediction:

Linear Regression

As a baseline measure, linear regression was performed in R using the lm function. I chose to regress from the first 5 principle components arbitrarily. While there are plenty of performance metrics available to gauge the performance, I chose to use mean squared error (MSE). The MSE’s for the training and testing predictions were 156.438 and 433.34 respectively. The slope coefficients for the first and second principle components were significant with p-values of \(9.01*10^{-7}\) and 0.0024, respectively.

Ensemble LDA

The ensemble LDA approach was also implemented in R. This approach was chosen over alternative classification approaches (k-means, random forest, etc.) because it reportedly performed better.

The approach utilized a number of binary LDA classifiers tasked with predicting whether or not a certain sample came from a particular “age bin” (eg. does this sample come from samples ages 0 to 10 or not?). The model is constructed by assembling many of these classifiers with age bins that overlap (eg. 0-10, 1-11, 2-12, etc.). Prediction occurs by passing one sample’s features to each of the classifiers and receiving the predictions as to whether or not the sample is predicted in or out of a given classifier’s age bin. Those that hit positively submit one vote for each age in their age-bin. The final prediction is the age with the highest number of votes. In the case of a tie, the average was returned.

Due to sparsity issues in the training set, the bins chosen for this model were 20 years in length, and spaced out every 6 (0-20, 6-26, 12-32, …). The MSE of the training predictions was 107.5, and the MSE of the testing predictions was 420.33.

Discussion

The high levels of sequence duplication raise the concern that the methods attempted in this project might not generalize in the real world as well as illustrated in the testing set. Without the use of spike-ins or UMI’s to help normalize such occurrences, the quality of the analysis is compromised, and there is little than can be done to address this.

On a different note, a positive take away from the project was that I found the ensemble method to be surprisingly powerful for how explainable it was. Repeatedly in biology, we observe strong non-linearities in the data sets we analyze, and while the advent of deep-learning can promise to resolve these non-lineraities with incredible accuracy, the tradeoff is a lack of explainability and disregard for the intuition of the methods that give rise to the non-linearities. I argue that the discretization of regression tasks into classification tasks is a perfect example of balancing the worlds of machine learning and biology. While non-linearities are considered and accounted for in the construction of age specific bins, the simplicity of the unit classifier helps retain the intuition behind mechanisms, and allows researchers to peer inside the black box and retrieve understanding.

Data Summary Table

Raw Read Counts Filtered Genes DGE Genes Dimension Reduction
Testing Set 24 x 60676 24 x 19964 24 x 808 24 x 24
Training Set 24 x 60676 24 x 19964 24 x 808 24 x 24

Software Package Version Table

Software Package Load Command Version
sra-toolkit spack load sra-toolkit 2.8.2-1
trimgalore spack load -r trimagalore 0.4.4_dev
fastqc spack load fastqc v0.11.7
multiqc spack load -r py-multiqc 1.7 (7d61ef5)
STAR spack load star@2.7.0e 2.7.0e
featureCounts spack load subread 1.6.2

References

Fleischer, Jason G, Roberta Schulte, Hsiao H Tsai, Swati Tyagi, Arkaitz Ibarra, Maxim N Shokhirev, Ling Huang, Martin W Hetzer, and Saket Navlakha. 2018. “Predicting Age from the Transcriptome of Human Dermal Fibroblasts.” Genome Biology 19 (1). Springer: 221.

Phillip, Jude M, Pei-Hsun Wu, Daniele M Gilkes, Wadsworth Williams, Shaun McGovern, Jena Daya, Jonathan Chen, et al. 2017. “Biophysical and Biomolecular Determination of Cellular Age in Humans.” Nature Biomedical Engineering 1 (7). Nature Publishing Group: 0093.

Sen, Payel, Parisha P Shah, Raffaella Nativio, and Shelley L Berger. 2016. “Epigenetic Mechanisms of Longevity and Aging.” Cell 166 (4). Elsevier: 822–39.

Tigges, Julia, Jean Krutmann, Ellen Fritsche, Judith Haendeler, Heiner Schaal, Jens W Fischer, Faiza Kalfalah, et al. 2014. “The Hallmarks of Fibroblast Ageing.” Mechanisms of Ageing and Development 138. Elsevier: 26–44.